Goto

Collaborating Authors

 segmentation mask


0266e33d3f546cb5436a10798e657d97-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their encouraging and constructive comments. We are pleased that they find the paper well1 written and acknowledge the novelty and originality of the proposed task, which "has a potential to spark interest"2 (R1) and "may lead to future papers studying it" (R2). Regarding the proposed framework, R1 and R2 not only find it3 "sound" and "novel" but also stress the "re-implementation ease" from which "practitioners may benefit" (R1). Still,4 the reviewers raise points of improvement (R1, R3) and suggest a discussion about a related task (R2). We carefully5 address these comments below.


Supplementary Materials: AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Neural Information Processing Systems

The series is directed by David Yates and distributed by Warner Bros. It consists of three fantasy films as of 2022: Fantastic Beasts and Where to Find Them (2016) [1]. The movie follows Newt Scamander, a magizoologist who travels to New York with a suitcase full of magical creatures. When some of the creatures escape, he teams up with a group of people to find them before they cause any harm.


details

Neural Information Processing Systems

A.1 MONet To segment each w hframe Ft into No object representations, MONet uses a recurrent attention network to obtain No attention masks Ati [0,1]w h for i = 1,...,No that represent the probability of each pixel in Ft belonging to the i-th object, with This attention network is coupled with a component VAE with latents zti Rd for i= 1,...,No that reconstructs Ati Ft, the i-th object in the image. The latent posterior distribution q(zt|Ft,Ati)is a diagonal Gaussian with mean µti, and we use µti as the representation of the i-th object. When these representations are fed into the transformer, we use a linear projection to map the raw object/word embeddings, which lie in Rd, to a vector in RdNH, where NH is the number of selfattention heads. This step is necessary as generally the latent dimensionality of MONet, d, is less than NH whereas a transformer expects the embedding size to be divisible by NH. A.2 Self-supervised training Recall in the main text that we wrote the auxiliary self-supervised loss as auxiliary loss = X A comparison of these losses and the masking schemes is given in Figure 4. We also tested a few variations of the contrastive loss inspired by literature and tested all combinations of variations.


TextDiffuser: Diffusion Models as Text Painters

Neural Information Processing Systems

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality.



Video-to-Video Synthesis

Neural Information Processing Systems

We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image translation problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without modeling temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generators and discriminators, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our method to future video prediction, outperforming several competing systems. Code, models, and more results are available at our website: https://github.com/NVIDIA/vid2vid. (Please use Adobe Reader to see the embedded videos in the paper.)


TextDiffuser: Diffusion Models as Text Painters

Neural Information Processing Systems

TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout.